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Abstract 

Search engines provide cached copies of indexed content so 
users will have something to "click on" if the remote resource is 
temporarily or permanently unavailable. Depending on their pro- 
prietary caching strategies, search engines will purge their indexes 
and caches of resources that exceed a threshold of unavailability. 
Although search engine caches are provided only as an aid to the 
interactive user, we are interested in building reliable preservation 
services from the aggregate of these limited caching services. But 
first, we must understand the contents of search engine caches. In 
this paper, we have examined the cached contents of Ask, Google, 
MSN and Yahoo to profile such things as overlap between index and 
cache, size, MIME type and "stateness " of the cached resources. 
We also examined the overlap of the various caches with the hold- 
ings of the Internet Archiv^ 

Introduction 

To provide resiliency against transient errors of indexed web 
pages, most search engines (SEs) provide links to cached versions 
of many of the resources they have indexed. Unlike the Internet 
Archive (IA), these SE caches do not represent an institutional com- 
mitment to preservation. Rather, they are intended to provide a link 
to the most recently crawled version of the resource if the current 
resource is unavailable. Sometimes the caches are not of the origi- 
nal resource, but the resource migrated to new a format (e.g., PDF 
to HTML). 

At Old Dominion University, we are engaged in a number of 
research projects that utilize SE caches as the building blocks for 
digital preservation services. This includes the "lazy preservation" 
project 1 13 1, which uses the IA and SE caches as a preservation 
strategy and the "just-in-time preservation" project |7 |, which uses 
the IA and SE caches to generate lexical signatures of missing re- 
sources to aid in the discovery of new or similar versions of the 
missing resource. Even though the SE caches are not "deep" like 
the IA, they are very broad and are quite useful in complimenting 
the IA's holdings. We know that SEs do not always immediately 
purge their caches if the original resources are unavailable. For 
example, in a previous experiment we observed in Google's cache 
resources that had been unavailable from the original source, and 
even missing from their cache, for more than two months 1141 . 

Since much of our research relies on SE caches, we have un- 
dertaken what we believe to be the first quantitative analysis of SE 
cache behaviors and contents. We examined the caches of Ask, 
Google, MSN and Yahoo by issuing dictionary-based queries to 
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the SEs and characterizing the caches of the returned resources. 
In particular, we measured mean file size, file MIME type, age of 
the cached resource, cache errors (i.e., the resource is declared as 
cached but not retrievable) and "cache-only" availability (i.e., the 
original resource is unavailable as indexed and the cached version 
is available). We also measured the overlap of the SE caches with 
the IA and computed the "staleness" of the cached resources. Fi- 
nally, we uncovered a number of cases where the SEs had cached 
what they arguably should not have (e.g., resources with a "Cache- 
Control: Private" header). 

Background and Related Work 

FigureQ]shows the results of searching Google for "Archiving 
2005". The first result is what we expect, and next to the URL is the 
link labeled "Cached". Clicking this link, we see the results shown 
in Figure[2] This cached version has a datestamp of March 3, 2007. 
This is typical of SE caches in that only the most recently cached 
version is available. 
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Figure 1. Google search for "Archiving 2005". 



In contrast, Figure [3] shows the multiple datestamped versions 
available from the Internet Archive. The IA, which first began 
archiving the Web in 1996 |9|, is unique in its mission of crawl- 
ing and archiving everything, with no specific accession policy. Al- 
though the IA is a tremendous public service, it has some significant 
limitations. The first of which is that the Alexa crawler (Alexa In- 
ternet does the crawling for IA) can be slow to visit a site. In a 
previous study, we noted that despite requesting to be archived, the 
Alexa crawler never visited our site in over 100 days (at which point 
we stopped checking) |T4|. The second limitation is that even after 
the site is crawled, the IA will not make accessible the resources 
until after 6-12 months have passed |8|. In summary, the IA can be 



a great boon, but it can be slow to acquire index resources, and it 
might not have found them at all. 

Besides a study by Lewandowski et al. 1101 which examined 
the freshness of 38 German web pages in SE caches, we are un- 
aware of any research that has characterized the SE caches or at- 
tempted to find the overlap of SE caches with the IA. 

Methodology 

We chose to study four popular search engines that cache 
content: Google, MSN, Yahoo, and Ask. We used the web search 
APIs provided by Google, MSN, and Yahoo for accessing their 
search results and page scraping for accessing Ask's search results 
since they do not provide an API. Although we have discovered in 
previous work that the search engine APIs do not always produce 
the same result as the web user interface 1121 . we used the APIs 
because Google and Yahoo will block access to clients that issue 
too many queries 1 1 1 1. 
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We are pleased to announce the program for the Second IS&T 
Archiving Conference. The 2005 conference looks to build on the 
success of the first and on the enthusiasm it created. The first 
conference brought together a diverse group of attendees from 
academia, industry museums, libraries, government institutions, 
and not-for-profit organizations. We expect the second to do the 



Figure 2. Google's cached copy with datestamp of March 3, 2007. 



Internet Archive Wayback Machine 



| ft | | C | I + I 111 rittp://wett.arctiive.org/weh/' ; 7http://wwvir.imaging.Qrg/(orfe f Ji "-fQ,- Google 



3 



Searched for http;y 1 www, imaging, o r g/ c on feren c es J a r c h i vi n g 2 O-OSy 



dc-i-iL.tr.- 

1993 


1997 


e was u 

1998 


Search Results for Jan 01, 1996 - Mar 09, 2007 

1999 2000 2001 2002 2003 2D04 2005 


2006 


2007 




pages 




pages 




pages 




pages 






pages 




pages 




pages 


5 pages 12 pages 























Jul M 2004 * Feb 07. 2005 * 
Auci 2G. 2004 * Feb 22. 2005 * 

!-[;'; ! *■ '=lj 23. 2005 
Ck£ 11 PfKW Mar 00. 20C5 
Dec 04 20C4 * Jun 20 20Q5 * 
Jul 07. 2005 
Jul 25. 2005 
Alio 25. 2005 
Auo 31 2005 
Nov 2C 2:"5 
Dec G3. _,. . 
Dec 23 J.. ■ -. 


Jan. 09. 2005 
jar J" jr.: is 
Jan 28 2005 
Feb 03 2006 
Fob 00. 2O0B 
A or 03. 2006 





Hom& | Help 
intern &t Arcriive | Temis ol Use | Privacy Police 



In February 2006, we issued 5200 one-term queries (randomly 
sampled from an English dictionary) to each search engine and ran- 
domly chose one of the first 100 results. We attempted to download 
the selected URL from the Web and also the cached resource from 
the SE. We also queried the IA to see how many versions of the 
URL it had stored for each year, if any. All SE responses, http 
headers, web pages, cached pages and IA responses were stored for 
later processing. 

Our sampling method produced several biases since it favors 
pages in English, long and content-rich pages which are more likely 
to match a query than smaller documents, and those pages that are 
more popular than others. New methods |2| have recently been 
developed to reduce these biases when sampling from SE indexes 
and could be used in future experiments. 

Cache Content 

We first examine the sampled cache contents and their distri- 
bution by top level domain, MIME type and size. We also examine 
the use of noarchive meta tags and http cache-control directives 
for keeping content out of SE caches. 

Cache and Web Overlap 

In Table Q] we see the percent of resources from each SE that 
were cached or not. Within these categories, we break-out those 
resources that were accessible on the Web or missing (http 4xx or 
5xx response or timed-out). Less than 9% of Ask's indexed contents 
were cached, but the other three search engines had at least 80% of 
their content cached. Over 14% of Ask's indexed content could not 
be successfully retrieved from the Web, and since most of these re- 
sources were not cached, the utility of Ask's cache is questionable. 
Google, MSN and Yahoo had far less missing content indexed, and 
a majority of it was accessible from their cache. 

The miss rate column in Table Q] is the percent of time the 
search engines advertised a link to a cached resource but returned an 
error page when the cached resource was accessed. Ask and MSN 
appear to have the most reliable cache access (although Ask's cache 
is very small). Note that Google's miss rate is probably higher be- 
cause Google's API does not advertise a link to the cached resource; 
the only way of knowing if a resource is cached or not is to attempt 
to access it. 

Web and Cache Overlap 
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Figure 3. Archived copies of Archiving 2005 web page. 



Top Level Domain 

Figure [4] shows the distribution of the top level domains 
(TLDs) of the sampled URLs from each search engine's index (only 
the top 15 are shown). Our findings are very similar to the distri- 
butions in 1 2]. All four search engines tend to sample equally from 
the same TLDs with .com being the largest by far. 
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Figure 5. Distribution of Web file sizes (left) and cached file sizes (right) on log-log scale. Web file size means: Ask = 88 KB, Google = 244 KB, MSN = 204 KB, 
Yahoo = 61 KB. Cached file size means: Ask = 74 KB, Google = 104 KB, MSN = 79 KB, Yahoo = 34 KB. 



Content Type 

Table[2]shows the distribution of resources sampled from each 
search engine's index (Ind column). The percent of those resources 
that were extracted successfully from cache is given under the Cac 
column. HTML was by far the most indexed of all resource types. 
Google, MSN and Yahoo provided a relatively high level of access 
to all cached resources, but only 10% of HTML and 11% of plain 
text resources could be extracted from Ask's cache, and no other 
content type was found in their cache. 

Indexed and Cached Content by Type 
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We found several media types indexed (but not cached) that 
we did not expect. We discovered two videos in Google using the 
Advanced Systems Format (ASF) and an audio file (MPEG) and 
Flash file indexed by Yahoo. Several XML resource types were 
also discovered (and some cached): XML Shareable Playlist (Ask), 
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Figure 4. Distribution of TLDs from sample. 



Atom (Google) and RSS feeds (Ask, Google and Yahoo), and OAI- 
PMH responses (Ask and Google). We did not find any XML types 
in MSN. 

File Sizes 

In Figure[5]we plot the file size distribution of the live web re- 
sources (left) and cached resources (right). The graphs use log-log 
scale to emphasize the power-law distribution of page sizes which 
has been observed on the Web (T). Before calculating the cached 
resource size, we stripped each resource of the SE header. All four 
SEs appeared to limit the size of their caches. The limits observed 
were: Ask: 976 KB, Google: 977 KB, MSN: 1 MB and Yahoo: 
215 KB. The caching limits affected approximately 3% of all re- 
sources cached. On average, Google and MSN indexed and cached 
the largest web resources. 

Cache Directives 

SEs and the IA use an opt-out policy approach to caching and 
archiving. All crawled resources are cached unless a web master 
uses the Robots Exclusion Protocol (robots.txt) to indicate URL 
patterns that should not be indexed (which also prevents them from 
being cached) or if noarchive meta tags are placed in HTML 
pages. There is currently no mechanism in place to permit a SE 
to index a non-HTML resource but not cache it. 

We found 2% of the HTML resources from the Web used 
noarchive meta tags. Only 6% specifically targeted googlebot, 
and 96% targeted all robots (none were targeting the other three 
SEs). We found only a hand-full of resources with noarchive 
meta tags that were cached by Google and Yahoo, but it is likely 
the tags were added after the SE crawlers had downloaded the re- 
sources since none of the tags were found in the cached resources. 

HTTP 1.1 has a number of cache-control directives that are 
used to indicate if the requested resource is to be cached, and if 
so, for how long. Whether or not these directives apply to search 
engines and web archives is a point of contention | 3|. One quarter 
(24%) of the sampled resources had an http header with Cache- 
Control set to no-cache, no-store or private, and 62% of these re- 
sources were cached. None of the SEs appeared to respect the 
cache-control directives since all four SEs cached these resources 
at the same rate as resources without the header. 



Cache Freshness 

We next examine the freshness of the SE caches. A cached 
copy of a Web resource is fresh if the Web resource has not changed 
since the last time it was crawled and cached. Once a resource has 
been modified, the cached resource becomes stale (or ages [ 5 ]). The 
staleness of the cache increases until the SE re-crawls the resource 
and updates its cache. 

To measure the staleness the of caches, we examined the Last- 
Modified http header of the live resource from the Web and the date 
from the cached resource. Although some servers do not return last 
modified dates (typically for dynamically produced resources) or 
return incorrect values 16], it is the best we can do to determine 
when the resource was last modified. Not all cached resources con- 
tain cache dates either; Google only reports cached dates for HTML 
resources, and Yahoo only reports last modified dates through their 
API. We calculated staleness (in days) by subtracting the cached 
date from the last modified date. If the cache date was more recent 
the last modified date, we assigned a value of to staleness. 

Only 46% of the live pages had a valid http Last-Modified 
timestamp, and of these, 71% also had a cached date. We found 
84% of the resources were up-to-date. The descriptive statistics for 
resources that were at least one day stale are given in Table [3] and 
the distribution is shown in Figure [6] Although Google had the 
largest amount of stale cached pages, Yahoo's pages were on av- 
erage more stale. MSN had the fewest amount of stale pages and 
nearly the most up-to-date set of pages. 

Staleness of Search Engine Caches (in Days) 
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Figure 6. Distribution of staleness on log-log scale. 

We also wanted to know how similar the cached resources 
were compared to the live resources from the Web. We would ex- 
pect up-to-date cached resources to be identical or nearly identical 
to their Web counter-parts. We would also expect web resources in 
formats that get converted into HTML (e.g., PDF, PostScript and 
Microsoft Office) to be very similar to their cached counterparts in 
terms of word order. When comparing live resources to crawled re- 



sources, we counted the number of shared shingles (of size 10) be- 
tween the two documents after stripping out all HTML (if present). 
Shingling (4) is a popular technique for quantifying similarity of 
text documents when word-order is important. 

We found that 19% of the cached resources were identical to 
their live counterparts, 21% if examining just HTML resources. On 
average, resources shared 72% of their shingles. This implies that 
although most web resources are not replicated in caches byte-for- 
byte, most of them are very similar to what is cached. 

In Figure[7]we have plotted each resource's 'similarity' value 
(percent of shared shingles) vs. its staleness. The busy scatterplots 
indicate there is no clear relationship between similarity and stale- 
ness; a cached resource is likely to be just as similar as its live Web 
counterpart if it is one or 100 days stale. 
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Figure 7. Scatterplots of similarity vs. staleness (x-axis is on a log scale). 



Overlap with Internet Archive 

We were interested in knowing how much content indexed and 
cached by the four SEs were also archived by the Internet Archive. 
Figure [8] shows a Venn diagram illustrating how some resources 
held by the IA are indexed by SEs (I) and/or cached (II). But there 
are some indexed (IV) and cached (III) resources that are not avail- 
able in the I A. 

Table [4] shows the overlap of sampled URLs within IA. MSN 
had the largest overlap with IA (52%) and Yahoo the smallest 
(41%). On average, only 46% of the sampled URLs from all four 
SEs were available in IA. 

Internet Archive Overlap 
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In Figure [9] we have plotted the distribution of the archived 
resources which shows an almost exponential increase each year. 
We suspect there were far fewer resources in 2006 since the IA is 
6-12 months out of date. The hit-rate line in Figure[9]is the percent 
of time the IA had at least one resource archived for that year. It is 




Figure 8. Venn diagram showing overlap of SE caches with IA. 



interesting to note that although the number of resources archived 
in 2006 was half that of 2004, the hit rate of 29% almost matched 
2004's 33% hit rate. 

Conclusions 

In this study, we have characterized the caches of Ask, Google, 
MSN and Yahoo by randomly choosing results from the top 100 
hits based on dictionary-based queries. From a digital preservation 
perspective, Ask was of limited utility; it had the fewest resources 
cached (9%), and although 14% of the resources it had indexed 
were unavailable from the Web, only 3% of them were accessible 
from their cache. The resources from Google (80%), MSN (93%) 
and Yahoo (80%) were cached much more frequently, and all had 
limited cache miss rates. Top level domains appear to be repre- 
sented in all four SE caches with roughly the same distribution. We 
found noarchive meta tags were infrequently used (2%) in sam- 
pled HTML resources, and SEs did not appear to respect http cache- 
control headers, two advantages from a preservation perspective. 

Search engines primarily index HTML, but of the resources 
that are indexed, all SEs but Ask cached non-HTML resources with 
about the same frequency. All SEs seemed to have an upper bound 
on cached resources of about 1 MB except for Yahoo which appears 
to have an upper bound of 215 KB; this only affected 3% of all 
cached resources. The "staleness" of the cached resources ranged 
from 12% (MSN) to 20% (Google), and median staleness ranged 
from 5 days (MSN) to 17 days (Yahoo). 

While the IA provides a preservation service for public web 
pages, its well-known limitations of crawling frequency and 6-12 
month delay in processing crawled resources limits its effective- 
ness. We found that the IA contained only 46% of the resources 
available in SE caches. More importantly, the number of resources 
available in neither a SE cache nor the I A is quite low: 4% for 
MSN, 5% for Google and 11% for Yahoo. Again, Ask (55%) per- 
forms poorly. 

Search engines provide access to cached copies as a secondary 
service to guard against temporary unavailability of the indexed re- 
sources. But given enough SEs and their respective scale, it be- 
comes possible (especially in combination with the IA) to build so- 
phisticated digital preservation services using SE caches. 
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